Search for: All records

Creators/Authors contains: "Vukotic, Ilija"

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Investigating Data Access Models for ATLAS: A Case Study with FABRIC Across Borders and ServiceX

https://doi.org/10.1051/epjconf/202533701117

Vukotic, Ilija; Hu, Fengping; Bryant, Lincoln; Gardner_Jr, Robert William; McKee, Shawn; Stephen, Judith; Jordan, David (October 2025, EPJ Web of Conferences)
Szumlak, T; Rachwał, B; Dziurda, A; Schulz, M; vom_Bruch, D; Ellis, K; Hageboeck, S (Ed.)
This study explores enhancements in analysis speed, WAN bandwidth efficiency, and data storage management through an innovative data access strategy. The proposed model introduces specialized ‘delivery’ services for data preprocessing, which include filtering and reformatting tasks executed on dedicated hardware located alongside the data repositories at CERN’s Tier-0, Tier-1, or Tier-2 facilities. Positioned near the source storage, these services are crucial for limiting redundant data transfers and focus on sending only vital data to distant analysis sites, aiming to optimize network and storage use at those sites. Within the scope of the NSF-funded FABRIC Across Borders (FAB) initiative, we assess this model using an “in-network, edge” computing cluster at CERN, outfitted with substantial processing capabilities (CPU, GPU, and advanced network interfaces). This edge computing cluster features dedicated network peering arrangements that link CERN Tier-0, the FABRIC experimental network, and an analysis center at the University of Chicago, creating a solid foundation for our research. Central to our infrastructure is ServiceX, an R&D software project under the Data Organization, Management, and Access (DOMA) group of the Institute for Research and Innovation in Software for High Energy Physics - IRIS-HEP. ServiceX is a scalable filtering and reformatting service, designed to operate within a Kubernetes environment and deliver output to an S3 object store at an analysis facility. Our study assesses the impact of server-side delivery services in augmenting the existing HEP computing model, particularly evaluating their possible integration within the broader WAN infrastructure. This model could empower Tier-1 and Tier-2 centers to become efficient data distribution nodes, enabling a more cost-effective way to disseminate data to analysis sites and object stores, thereby improving data access and efficiency. This research is experimental and serves as a demonstrator of the capabilities and improvements that such integrated computing models could offer in the HL-LHC era.
more » « less
Free, publicly-accessible full text available October 7, 2026
Building Scalable Analysis Infrastructure for ATLAS

https://doi.org/10.1051/epjconf/202533701062

Bryant, Lincoln; Gardner_Jr, Robert William; Golnaraghi, Farnaz; Hu, Fengping; Jordan, David; Lancon, Eric Christian; Rosberg, Aidan; Stephen, Judith; Taylor, Ryan Paul; Vukotic, Ilija (October 2025, EPJ Web of Conferences)
Szumlak, T; Rachwał, B; Dziurda, A; Schulz, M; vom_Bruch, D; Ellis, K; Hageboeck, S (Ed.)
We explore the adoption of cloud-native tools and principles to forge flexible and scalable infrastructures, aimed at supporting analysis frameworks being developed for the ATLAS experiment in the High Luminosity Large Hadron Collider (HL-LHC) era. The project culminated in the creation of a federated platform, integrating Kubernetes clusters from various providers such as Tier-2 centers, Tier-3 centers, and from the IRIS-HEP Scalable Systems Laboratory, a National Science Foundation project. A unified interface was provided to streamline the management and scaling of containerized applications. Enhanced system scalability was achieved through integration with analysis facilities, enabling spillover of Jupyter/Binder notebooks and Dask workers to Tier-2 resources. We investigated flexible deployment options for a “stretched” (over the wide area network) cluster pattern, including a centralized “lights out management” model, remote administration of Kubernetes services, and a fully autonomous site-managed cluster approach, to accommodate varied operational and security requirements. The platform demonstrated its efficacy in multi-cluster demonstrators for low-latency analyses and advanced workflows with tools such as Coffea, ServiceX, Uproot and Dask, and RDataFrame, illustrating its ability to support various processing frameworks. The project also resulted in a robust user training infrastructure for ATLAS software and computing on-boarding events.
more » « less
Free, publicly-accessible full text available October 7, 2026
Operating the 200 Gbps IRIS-HEP Demonstrator for ATLAS

https://doi.org/10.1051/epjconf/202533701061

Gardner_Jr, Robert W; Benjamin, Douglas; Bryant, Lincoln; Feickert, Matthew; Golnaraghi, Farnaz; Held, Alexander; Hu, Fengping; Jordan, David; Stephen, Judith; Vukotic, Ilija; et al (October 2025, EPJ Web of Conferences)
Szumlak, T; Rachwał, B; Dziurda, A; Schulz, M; vom_Bruch, D; Ellis, K; Hageboeck, S (Ed.)
The ATLAS experiment is currently developing columnar analysis frameworks which leverage the Python data science ecosystem. We describe the construction and operation of the infrastructure necessary to support demonstrations of these frameworks, with a focus on those from IRIS-HEP. One such demonstrator aims to process the compact ATLAS data format PHYSLITE at rates exceeding 200 Gbps. Various access configurations and setups on different sites are explored, including direct access to a dCache storage system via Xrootd, the use of ServiceX, and the use of multiple XCache servers equipped with NVMe storage devices. Integral to this study was the analysis of network traffic and bottlenecks, worker node scheduling and disk configurations, and the performance of an S3 object store. The system’s overall performance was measured as the number of processing cores scaled to over 2,000 and the volume of data accessed in an interactive session approached 200 TB. The presentation will delve into the operational details and findings related to the physical infrastructure that underpins these demonstrators.
more » « less
Free, publicly-accessible full text available October 7, 2026
The 200 Gbps Challenge: Imagining HL-LHC analysis facilities

https://doi.org/10.1051/epjconf/202533701217

Held, Alexander; Albin, Sam; Attebury, Garhan; Bloom, Kenneth; Bockelman, Brian; Bryant, Lincoln; Choi, Kyungeon; Cranmer, Kyle; Elmer, Peter; Feickert, Matthew; et al (October 2025, EPJ Web of Conferences)
Szumlak, T; Rachwał, B; Dziurda, A; Schulz, M; vom_Bruch, D; Ellis, K; Hageboeck, S (Ed.)
The IRIS-HEP software institute, as a contributor to the broader HEP Python ecosystem, is developing scalable analysis infrastructure and software tools to address the upcoming HL-LHC computing challenges with new approaches and paradigms, driven by our vision of what HL-LHC analysis will require. The institute uses a “Grand Challenge” format, constructing a series of increasingly large, complex, and realistic exercises to show the vision of HL-LHC analysis. Recently, the focus has been demonstrating the IRIS-HEP analysis infrastructure at scale and evaluating technology readiness for production. As a part of the Analysis Grand Challenge activities, the institute executed a “200 Gbps Challenge”, aiming to show sustained data rates into the event processing of multiple analysis pipelines. The challenge integrated teams internal and external to the institute, including operations and facilities, analysis software tools, innovative data delivery and management services, and scalable analysis infrastructure. The challenge showcases the prototypes — including software, services, and facilities — built to process around 200 TB of data in both the CMS NanoAOD and ATLAS PHYSLITE data formats with test pipelines. The teams were able to sustain the 200 Gbps target across multiple pipelines. The pipelines focusing on event rate were able to process at over 30 MHz. These target rates are demanding; the activity revealed considerations for future testing at this scale and changes necessary for physicists to work at this scale in the future. The 200 Gbps Challenge has established a baseline on today’s facilities, setting the stage for the next exercise at twice the scale.
more » « less
Free, publicly-accessible full text available October 7, 2026
TRACER (TRACe route ExploRer): A tool to explore OSG/WLCG network route topologies

https://doi.org/10.1142/S0217751X21300052

Tretyakov, Evgeniy; Artamonov, Alexey; Grigorieva, Maria; Klimentov, Alexei; McKee, Shawn; Vukotic, Ilija (February 2021, International Journal of Modern Physics A)
null (Ed.)
The experiments at the Large Hadron Collider (LHC) rely upon a complex distributed computing infrastructure (WLCG) consisting of hundreds of individual sites worldwide at universities and national laboratories, providing about half a billion computing job slots and an exabyte of storage interconnected through high speed networks. Wide Area Networking (WAN) is one of the three pillars (together with computational resources and storage) of LHC computing. More than 5 PB/day are transferred between WLCG sites. Monitoring is one of the crucial components of WAN and experiments operations. In the past years all experiments have invested significant effort to improve monitoring and integrate networking information with data management and workload management systems. All WLCG sites are equipped with perfSONAR servers to collect a wide range of network metrics. We will present the latest development to provide the 3D force directed graph visualization for data collected by perfSONAR. The visualization package allows site admins, network engineers, scientists and network researchers to better understand the topology of our Research and Education networks and it provides the ability to identify nonreliable or/and nonoptimal network paths, such as those with routing loops or rapidly changing routes.
more » « less
Full Text Available
Towards a NoOps Model for WLCG

https://doi.org/https://doi.org/10.1051/epjconf/202024507024

Gardner, Robert; Bryant, Lincoln; Stephen, Judith; Vukotic, Ilija; Weaver, Christopher; Wu, Wenjing (November 2020, 24th International Conference on Computing in High Energy and Nuclear Physics (CHEP 2019))
null (Ed.)
One of the most costly factors in providing a global computing infrastructure such as the WLCG is the human effort in deployment, integration, and operation of the distributed services supporting collaborative computing, data sharing and delivery, and analysis of extreme scale datasets. Furthermore, the time required to roll out global software updates, introduce new service components, or prototype novel systems requiring coordinated deployments across multiple facilities is often increased by communication latencies, staff availability, and in many cases expertise required for operations of bespoke services. While the WLCG (and distributed systems implemented throughout HEP) is a global service platform, it lacks the capability and flexibility of a modern platform-as-a-service including continuous integration/continuous delivery (CI/CD) methods, development-operations capabilities (DevOps, where developers assume a more direct role in the actual production infrastructure), and automation. Most importantly, tooling which reduces required training, bespoke service expertise, and the operational effort throughout the infrastructure, most notably at the resource endpoints (sites), is entirely absent in the current model. In this paper, we explore ideas and questions around potential NoOps models in this context: what is realistic given organizational policies and constraints? How should operational responsibility be organized across teams and facilities? What are the technical gaps? What are the social and cybersecurity challenges? Conversely what advantages does a NoOps model deliver for innovation and for accelerating the pace of delivery of new services needed for the HL-LHC era? We will describe initial work along these lines in the context of providing a data delivery network supporting IRIS-HEP DOMA R&D.
more » « less
Full Text Available
WLCG Networks: Update on Monitoring and Analytics

https://doi.org/10.1051/epjconf/202024507053

Babik, Marian; McKee, Shawn; Andrade, Pedro; Bockelman, Brian Paul; Gardner, Robert; Fajardo Hernandez, Edgar Mauricio; Martelli, Edoardo; Vukotic, Ilija; Weitzel, Derek; Zvada, Marian (January 2020, EPJ Web of Conferences)
Doglioni, C.; Kim, D.; Stewart, G.A.; Silvestris, L.; Jackson, P.; Kamleh, W. (Ed.)
WLCG relies on the network as a critical part of its infrastructure and therefore needs to guarantee effective network usage and prompt detection and resolution of any network issues including connection failures, congestion and traffic routing. The OSG Networking Area, in partnership with WLCG, is focused on being the primary source of networking information for its partners and constituents. It was established to ensure sites and experiments can better understand and fix networking issues, while providing an analytics platform that aggregates network monitoring data with higher level workload and data transfer services. This has been facilitated by the global network of the perfSONAR instances that have been commissioned and are operated in collaboration with WLCG Network Throughput Working Group. An additional important update is the inclusion of the newly funded NSF project SAND (Service Analytics and Network Diagnosis) which is focusing on network analytics. This paper describes the current state of the network measurement and analytics platform and summarises the activities taken by the working group and our collaborators. This includes the progress being made in providing higher level analytics, alerting and alarming from the rich set of network metrics we are gathering.
more » « less
Full Text Available
StashCache: A Distributed Caching Federation for the Open Science Grid

https://doi.org/10.1145/3332186.3332212

Weitzel, Derek; Zvada, Marian; Vukotic, Ilija; Gardner, Rob; Bockelman, Brian; Rynge, Mats; Hernandez, Edgar Fajardo; Lin, Brian; Selmeci, Mátyás (January 2019, Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning) (PEARC ‘19). ACM, New York, NY, USA, Article 58, 7 pages.)

Data distribution for opportunistic users is challenging as they neither own the computing resources they are using or any nearby storage. Users are motivated to use opportunistic computing to expand their data processing capacity, but they require storage and fast networking to distribute data to that processing. Since it requires significant management overhead, it is rare for resource providers to allow opportunistic access to storage. Additionally, in order to use opportunistic storage at several distributed sites, users assume the responsibility to maintain their data. In this paper we present StashCache, a distributed caching federation that enables opportunistic users to utilize nearby opportunistic storage. StashCache is comprised of four components: data origins, redirectors, caches, and clients. StashCache has been deployed in the Open Science Grid for several years and has been used by many projects. Caches are deployed in geographically distributed locations across the U.S. and Europe. We will present the architecture of StashCache, as well as utilization information of the infrastructure. We will also present performance analysis comparing distributed HTTP Proxies vs StashCache.
more » « less
Full Text Available
Building the SLATE Platform

https://doi.org/10.1145/3219104.3219144

Breen, Joe; McKee, Shawn; Riedel, Benedikt; Stidd, Jason; Truong, Luan; Vukotic, Ilija; Bryant, Lincoln; Carcassi, Gabriele; Chen, Jiahui; Gardner, Robert W.; et al (July 2018, Proceedings of the Practice and Experience on Advanced Research Computing)

We describe progress on building the SLATE (Services Layer at the Edge) platform. The high level goal of SLATE is to facilitate creation of multi-institutional science computing systems by augmenting the canonical Science DMZ pattern with a generic, "programmable", secure and trusted underlayment platform. This platform permits hosting of advanced container-centric services needed for higher-level capabilities such as data transfer nodes, software and data caches, workflow services and science gateway components. SLATE uses best-of-breed data center virtualization and containerization components, and where available, software defined networking, to enable distributed automation of deployment and service lifecycle management tasks by domain experts. As such it will simplify creation of scalable platforms that connect research teams, institutions and resources to accelerate science while reducing operational costs and development cycle times.
more » « less
Full Text Available
Managing Privilege and Access on Federated Edge Platforms

https://doi.org/10.1145/3332186.3332234

Breen, Joe; Bryant, Lincoln; Chen, Jiahui; Ford, Emerson; Gardner, Robert W.; Glupker, Gage; Griffith, Skyler; Kulbertis, Ben; McKee, Shawn; Pierce, Rose; et al (January 2019, Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning))

Full Text Available

« Prev Next »